Skip to content

CNTRLPLANE-3523: Retry on InvalidRouteTableId.NotFound during route creation#8747

Open
vismishr wants to merge 1 commit into
openshift:mainfrom
vismishr:CNTRLPLANE-3523/fix-route-table-retry
Open

CNTRLPLANE-3523: Retry on InvalidRouteTableId.NotFound during route creation#8747
vismishr wants to merge 1 commit into
openshift:mainfrom
vismishr:CNTRLPLANE-3523/fix-route-table-retry

Conversation

@vismishr

@vismishr vismishr commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

hcp create cluster aws fails with InvalidRouteTableID.NotFound due to a race condition with AWS eventual consistency. The CLI creates a route table and immediately tries to add routes before AWS has propagated the resource. This results in a 100% failure rate with no retry logic to handle the transient error.

This PR adds retry with exponential backoff on InvalidRouteTableId.NotFound errors when creating routes after route table creation, matching the existing pattern already used in CreateVPCS3Endpoint.

Files changed:

  • cmd/infra/aws/ec2.goCreatePrivateRouteTable and CreatePublicRouteTable

Changes:

  • CreatePrivateRouteTable: Added invalidRouteTableID to the existing retriable error check alongside invalidNATGatewayError
  • CreatePublicRouteTable: Wrapped the CreateRoute call in retry.OnError with backoff on invalidRouteTableID

Both use the existing retryBackoff (5 steps, 3s base, 3x factor) already defined and proven in production via CreateVPCS3Endpoint.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3523

Special notes for your reviewer:

  • The retry is idempotent — CreateRoute with the same destination CIDR on the same route table either succeeds or returns RouteAlreadyExists.
  • The retry is narrowly scoped to only InvalidRouteTableId.NotFound — any other error fails immediately.
  • The exact same pattern (retry on invalidRouteTableID with retryBackoff) is already used in CreateVPCS3Endpoint in the same file.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved resilience of cloud infrastructure provisioning by enhancing error handling for network route creation. The system now more effectively retries on transient API failures, reducing deployment interruptions and improving overall reliability.

AWS eventual consistency can cause CreateRoute to fail with
InvalidRouteTableId.NotFound immediately after CreateRouteTable
returns. Add retry with backoff for this error, matching the
existing pattern in CreateVPCS3Endpoint.
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 17, 2026
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 17, 2026
@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot

openshift-ci-robot commented Jun 17, 2026

Copy link
Copy Markdown

@vismishr: This pull request references CNTRLPLANE-3523 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What this PR does / why we need it:

hcp create cluster aws fails with InvalidRouteTableID.NotFound due to a race condition with AWS eventual consistency. The CLI creates a route table and immediately tries to add routes before AWS has propagated the resource. This results in a 100% failure rate with no retry logic to handle the transient error.

This PR adds retry with exponential backoff on InvalidRouteTableId.NotFound errors when creating routes after route table creation, matching the existing pattern already used in CreateVPCS3Endpoint.

Files changed:

  • cmd/infra/aws/ec2.goCreatePrivateRouteTable and CreatePublicRouteTable

Changes:

  • CreatePrivateRouteTable: Added invalidRouteTableID to the existing retriable error check alongside invalidNATGatewayError
  • CreatePublicRouteTable: Wrapped the CreateRoute call in retry.OnError with backoff on invalidRouteTableID

Both use the existing retryBackoff (5 steps, 3s base, 3x factor) already defined and proven in production via CreateVPCS3Endpoint.

Which issue(s) this PR fixes:

Fixes https://redhat.atlassian.net/browse/CNTRLPLANE-3523

Special notes for your reviewer:

  • The retry is idempotent — CreateRoute with the same destination CIDR on the same route table either succeeds or returns RouteAlreadyExists.
  • The retry is narrowly scoped to only InvalidRouteTableId.NotFound — any other error fails immediately.
  • The exact same pattern (retry on invalidRouteTableID with retryBackoff) is already used in CreateVPCS3Endpoint in the same file.

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 4fb39afc-008b-444b-90d1-5dbe92f1c532

📥 Commits

Reviewing files that changed from the base of the PR and between 7507291 and fef8eee.

📒 Files selected for processing (1)
  • cmd/infra/aws/ec2.go

📝 Walkthrough

Walkthrough

In cmd/infra/aws/ec2.go, two functions receive targeted retry improvements for AWS route creation. CreatePrivateRouteTable expands its existing retry.OnError predicate to treat InvalidRouteTableId.NotFound as retriable alongside the already-handled InvalidNatGatewayID.NotFound. CreatePublicRouteTable gains a new retry.OnError wrapper around its internet-gateway CreateRoute call, with a predicate that retries exclusively on InvalidRouteTableId.NotFound; previously this call had no targeted retry behavior.

🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly summarizes the main change: adding retry logic for InvalidRouteTableId.NotFound errors during route creation in AWS EC2 functions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo tests exist in this PR. Changes are limited to production code in cmd/infra/aws/ec2.go. Repository uses Go's standard testing framework, not Ginkgo. Check is not applicable.
Test Structure And Quality ✅ Passed This PR contains no test code changes. The check for Ginkgo test structure is not applicable since only cmd/infra/aws/ec2.go was modified, and the PR explicitly notes no unit tests were included.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies AWS EC2 infrastructure provisioning code in cmd/infra/aws/ec2.go, not Kubernetes deployment manifests, operators, or controllers. No scheduling constraints are introduced.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR modifies AWS infrastructure code (ec2.go) to add retry logic for route table creation. No Ginkgo e2e tests are added in this PR, so the IPv6/disconnected network test compatibility check is not...
No-Weak-Crypto ✅ Passed No weak cryptography patterns (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or insecure secret/token comparisons found in cmd/infra/aws/ec2.go.
Container-Privileges ✅ Passed PR modifies AWS infrastructure code (cmd/infra/aws/ec2.go), not K8s/container manifests. No container security specs or K8s configurations are present in the changes.
No-Sensitive-Data-In-Logs ✅ Passed No sensitive data logging found. The PR adds retry logic for route creation with logging statements that only expose AWS resource IDs (rtb-, nat-, igw-*), which are non-sensitive public identifie...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vismishr
Once this PR has been reviewed and has the lgtm label, please assign devguyio for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/platform/aws PR/issue for AWS (AWSPlatform) platform and removed do-not-merge/needs-area labels Jun 17, 2026
@vismishr vismishr marked this pull request as ready for review June 17, 2026 02:58
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 17, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and enxebre June 17, 2026 02:58
@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.79%. Comparing base (7507291) to head (fef8eee).

Files with missing lines Patch % Lines
cmd/infra/aws/ec2.go 0.00% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8747      +/-   ##
==========================================
- Coverage   41.79%   41.79%   -0.01%     
==========================================
  Files         759      759              
  Lines       94037    94047      +10     
==========================================
  Hits        39304    39304              
- Misses      51983    51993      +10     
  Partials     2750     2750              
Files with missing lines Coverage Δ
cmd/infra/aws/ec2.go 0.00% <0.00%> (ø)
Flag Coverage Δ
cmd-support 35.10% <0.00%> (-0.02%) ⬇️
cpo-hostedcontrolplane 44.10% <ø> (ø)
cpo-other 43.45% <ø> (ø)
hypershift-operator 51.87% <ø> (ø)
other 31.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@openshift-ci

openshift-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@vismishr: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hypershift-jira-solve-ci

hypershift-jira-solve-ci Bot commented Jun 17, 2026

Copy link
Copy Markdown

Now I have all the evidence. Let me compile the final report:

Test Failure Analysis Complete

Job Information

  • Prow Job: codecov/patch and codecov/project (GitHub Check Runs, not Prow CI jobs)
  • Build ID: Check Run IDs 81811339220 (patch) and 81811337780 (project)
  • PR: #8747CNTRLPLANE-3523: Retry on InvalidRouteTableId.NotFound during route creation
  • File Changed: cmd/infra/aws/ec2.go

Test Failure Analysis

Error

codecov/patch: 0.00% of diff hit (target 41.79%) — 15 lines in changes missing coverage
codecov/project: 41.79% (-0.01%) compared to base commit 7507291

Summary

Both codecov checks fail because the PR adds 15 new/modified lines to cmd/infra/aws/ec2.go (retry logic wrapping CreateRoute calls with retry.OnError for InvalidRouteTableId.NotFound), and none of these lines are covered by unit tests. The file ec2.go has 0% test coverage on the main branch — there is no ec2_test.go file. The codecov/patch check fails because 0 of 15 diff lines are hit (target is 41.79%), and codecov/project fails because overall project coverage dropped by 0.01% (10 net new uncovered lines out of ~94k total). These are non-blocking informational checks — codecov is not listed in required status checks for the main branch, and other PRs (e.g., #8744) have merged with codecov passing. The PR itself is a low-risk, narrowly-scoped retry addition reusing an existing proven pattern (retryBackoff + invalidRouteTableID) already used in CreateVPCS3Endpoint in the same file.

Root Cause

The root cause is straightforward: the modified file cmd/infra/aws/ec2.go has zero unit test coverage on the main branch — no ec2_test.go exists. The file contains AWS infrastructure provisioning functions (CreatePrivateRouteTable, CreatePublicRouteTable, CreateVPC, etc.) that make real AWS API calls, making them difficult to unit test without mocking the AWS EC2 client.

Specifically:

  1. codecov/patch failure: The patch coverage check compares new/modified lines against the project-wide target (41.79%). Since 0 of 15 changed lines are exercised by any test, patch coverage is 0.00%, falling below the default Codecov threshold.
  2. codecov/project failure: The project added 10 net new uncovered lines (94,037 → 94,047 total lines, misses 51,983 → 51,993), causing overall coverage to drop from 41.80% to 41.79% — a 0.01% decrease. Codecov's default project check fails when coverage decreases compared to the base commit.

The codecov.yml configuration has no explicit coverage.status section, so Codecov applies its defaults: project coverage must not decrease, and patch coverage must meet the project target. Both conditions fail for this PR.

Critically, these checks are not required for merge. The repository's branch protection does not list codecov/patch or codecov/project as required status checks. Recently merged PR #8744 shows both codecov checks passing, but the codecov action bump PR #8727 merged without codecov results at all — confirming they are advisory.

Recommendations
  1. These failures are safe to ignore for merging — codecov checks are not required status checks on the main branch. The PR can proceed through the normal review/approval/merge process.

  2. If the team wants to fix the codecov status (optional), add unit tests for the retry logic in CreatePrivateRouteTable and CreatePublicRouteTable. This would require:

    • Creating cmd/infra/aws/ec2_test.go
    • Introducing an interface for the EC2 client (or using the existing ec2iface / mock pattern from support/awsapi/)
    • Testing that isRetriable() returns true for InvalidRouteTableId.NotFound and false for other errors
    • Testing that retry.OnError retries correctly on retriable errors
  3. Consider adding cmd/infra/aws/ec2.go to the codecov ignore list in codecov.yml if the team decides this infrastructure-provisioning code should not be subject to coverage requirements (similar to how cmd/infra/azure/types.go and cmd/infra/gcp/constants.go are already ignored).

  4. No code changes needed in the PR itself — the retry logic is correct, idempotent, and follows the exact same pattern already used in CreateVPCS3Endpoint in the same file.

Evidence
Evidence Detail
codecov/patch output 0.00% of diff hit (target 41.79%) — 15 lines missing coverage
codecov/project output 41.79% (-0.01%) compared to 7507291 — coverage decreased
ec2.go base coverage 0.00% <0.00%> (ø) — file had 0% coverage before this PR (no change)
ec2_test.go exists? No — no unit test file exists for cmd/infra/aws/ec2.go
Test files in cmd/infra/aws/ destroy_iam_test.go, destroy_test.go, iam_test.go, route53_test.go — none cover ec2.go
Lines added/modified 10 net new lines (94,037 → 94,047), 15 diff lines total
Coverage impact Hits unchanged (39,304), Misses +10 (51,983 → 51,993), Partials unchanged (2,750)
Required status check? No — codecov is not in branch protection required checks
Recently merged with codecov pass PR #8744 merged 2026-06-17 with codecov/patch and codecov/project both passing
codecov.yml config No coverage.status section → Codecov defaults apply (no decrease allowed, patch must meet target)
PR scope 2 functions modified in 1 file; retry logic reuses existing retryBackoff + invalidRouteTableID pattern from CreateVPCS3Endpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli Indicates the PR includes changes for CLI area/platform/aws PR/issue for AWS (AWSPlatform) platform jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants